-
Notifications
You must be signed in to change notification settings - Fork 24
Use tilized dram-interleaved as default input-output layout #1744
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
1a31acb
to
d4c5383
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed the compiler portion, will take a look at the runtime part later today!
currentL1Usage -= currentL1UsagePerOp[op].l1MemUsagePerUser; | ||
currentL1UsagePerOp.erase(op); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fbajraktariTT, can you review this file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI @odjuricicTT, as @fbajraktariTT completed internship recently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jnie-TT I'm not sure that this extra logic is needed. Was a test failing without this temp fix? If so, can you provide more details?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@odjuricicTT there's an assert below that checks if the currentL1Usage
is 0. This error only surfaces with my changes - it's fine without my changes because we always untilize (to_layout) before returning. However it's possible now with my changes that we will return directly without any intermediate ops between the current op and the return op, and this causes issues because we wouldn't have zeroed out currentL1Usage
.
Since this function doesn't decrement l1 usage on return op, the assert will fire and say that the l1 usage is non 0. My change basically adds a check that if the consumer op is a return op, we decrement the l1 usage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jnie-TT Thanks! Your solution is fine for now. Just please file the followup issue and reference it in the comment.
|
||
// TTNN Reshape does not support implicit tilization/untilization | ||
// Therefore input output layouts should be the same | ||
if (mlir::isa<ttir::ReshapeOp>(operation) && operandNumber == 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like we should have attributes on the op that denote these kind of capabilities instead of having this code be special cased for a specific op. @sdjordjevicTT thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we should add an interface to all TTNN ops called shouldTilize
that defaults to true
and that ops can specialize.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that would be awesome to have, I know a lot of eltwise ops are facing a similar issue regarding data type, where some ops can typecast implicitly whereas some ops cannot. This results in the IR being misaligned with the actual runtime output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am thinking about this scenarios, do we have some examples?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aren't these examples? i.e. reshape, conv2d, slice & embedding, or do you mean something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sdjordjevicTT if you mean the implicit typecast ops an example would be relational binary ops vs unary ops .
Relational operations take an output_dtype that we setting to typecast implicitly within the op:
template <BinaryOpType binary_op_type>
struct RelationalBinary {
static Tensor invoke(
uint8_t queue_id,
const Tensor &input_tensor_a_arg,
const Tensor &input_tensor_b_arg,
const std::optional<const DataType> &output_dtype = std::nullopt,
const std::optional<MemoryConfig> &memory_config = std::nullopt,
std::optional<Tensor> optional_output_tensor = std::nullopt,
std::optional<unary::FusedActivations> activations = std::nullopt,
std::optional<unary::UnaryWithParam> input_tensor_a_activation = std::nullopt);
However unary ops do not:
template <UnaryOpType... unary_op_types>
Tensor ExecuteUnary<unary_op_types...>::invoke(
const Tensor& input_tensor,
const std::optional<MemoryConfig>& memory_config,
const std::optional<Tensor>& optional_output_tensor) {
And our compiler doesn't distinguish between them, i.e. for unary ops it'll still assume the output tensor of the unary op is properly typecasted to the desired data type.
As for ops that don't support implicit tilization/untilization, some examples include reshape, concat, transpose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe there was a misunderstanding between us. :)
Regarding Conv, Slice, and Embedding, I'm aware that they require some inputs to be in a row-major layout. I'll address this by implementing the necessary layout workarounds. If the Metal developers decide not to support tile layout for them, then we can introduce a trait\interface to accommodate them.
Regarding the implicit conversions, I get it for the data_type, but how we are specifying whether the output is in tile\row major? By defining the optional_output_tensor? I see what can be the issue, if you have some row-major input, you want to keep it row-major output for such ops. We can think about adding the interface on an op level to support this.
I created issues on myself to follow up on this:
if (mlir::isa<ttir::Conv2dOp>(operation) || | ||
mlir::isa<ttir::SliceOp>(operation) || | ||
(mlir::isa<ttir::EmbeddingBackwardOp>(operation) && | ||
operandNumber < 2)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be cleaned up with the workarounds. We have tasks for each of these to cleanup.
d4c5383
to
676c714
Compare
if (mlir::isa<ttir::Conv2dOp>(operation) || | ||
mlir::isa<ttir::SliceOp>(operation) || | ||
(mlir::isa<ttir::EmbeddingBackwardOp>(operation) && | ||
operandNumber < 2)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be cleaned up with the workarounds. We have tasks for each of these to cleanup.
|
||
// TTNN Reshape does not support implicit tilization/untilization | ||
// Therefore input output layouts should be the same | ||
if (mlir::isa<ttir::ReshapeOp>(operation) && operandNumber == 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am thinking about this scenarios, do we have some examples?
676c714
to
a9a8eff
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments on Optimizer related changes, but looks good overall.
Requesting changes until optimizer layout overrides are fixed. I'll help with this.
// API can determine the correct devices. Enabling this workaround will assume | ||
// that a device tensor will reside in the L1/Dram of the first device (device | ||
// id 0) of the device grid. This should be removed once we add the device | ||
// grid information to the tensorDesc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So there is strategy
on LayoutDesc
that will be set to ::tt::target::DistributedTensorConfig::NONE
for single chip setup. Or it will be set to some kind of multi-device distribution if set. LMK if this doesn't resolve this issue.
table LayoutDesc {
stride: [int];
oob_val: OOBVal;
core_range_set: [Dim2dRange];
memory_desc: MemoryDesc;
strategy: DistributionStrategy;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reach out to @wooseokTT if you need help/interpreting its programming.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nsmithtt the strategy doesn't tell us which submesh a tensor belongs to though right? I remember that when I added it, it was used to specify the tensor distribution method across multi device (replicate, shard etc.).
I can use it to distinguish between single/multichip, but I don't know the mesh shape or mesh offset that the tensor is mapped to if it's multidevice. And I need this info if I want to move a tensor to a multidevice mesh in the toLayout API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It depends on the strategy, but for e.g. ShardTensor2D.shard_mesh
does tell you the mesh shape. I think it's always inferred that the offset is implicitly [0, 0]
, @wooseokTT feel free to correct me if I'm wrong, but this is reflected from TTNN API which doesn't support arbitrary mesh offsets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah you're right ShardTensor2D has the shard_mesh. But seems like the other ones don't... If all we're using is ShardTensor2D and offset 0, 0 then I guess I can just derive it from that. And maybe add an assert that checks the strategy must be ShardTensor2D. Does doing it this way make sense with how we're performing multichip operations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jnie-TT I'm preparing for this change in tt-forge-fe
, so i've tested this branch. The only issue i've observed is not related directly to your change, but it has been exposed by it.
For tilized tensors we can have cases when FEs will get wrong stride information (when trying to allocate buffers for them).
The problem occurs when serializing layout into the flatbuffer. And its due to the fact that we have tied stride calculation to the layout attribute, but as it is currently implemented, same layout attributes can produce different strides (depending on the logical shape of the tensor). So, we end up with a problem when serializing into the flatbuffer due to the way the caching mechanism works there (all tensors with the same layout attribute will be serialized exactly the same).
For e.g., all tensors with layout: #ttnn.ttnn_layout<(d0, d1) -> (d0, d1), <1x1>, memref<1x1x!tt.tile<32x32, bf16>, #dram>, <interleaved>>
will end up having same strides. Even though they can have different logical shapes.
Note:
I am not sure if getting stride for tilized tensors even makes sense, so we might want to introduce a different mechanism.
Yikes!! That's a good catch. It seems our options are:
Open to additional ideas/alternatives |
@pilkicTT, what is the fallout of the stride caching bug? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optimizer changes look good.
I don't have the numbers, but for now we can work around it by assuming "uniform" stride and calculating it from the logical shapes. Additionally, regardless of this issue, we need to untilize the output tensors before We should probably check other frontends and see if they are affected by these problems as well. |
Maybe we should just get in a proper fix as a separate PR first. This seems like an innocuous and nasty bug that would suck to rediscover during the interim period. |
Let's do that, that would be great! |
03dfc27
to
7434d7d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pushing this change! A couple of comments inline, but nothing blocking.
auto device = op.getDevice(); | ||
assert((device || output.isOnHost()) && | ||
"Op device must be set for output tensors on device"); | ||
if (not device and not output.isOnHost()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Per coding style guidelines we decided not to use alternative tokens(not, and, or, etc...):
https://docs.tenstorrent.com/tt-mlir/coding-guidelines.html#using-alternative-tokens-and-or-xor-etc
Can you please switch to regular logical symbols? I have an ongoing PR, that fixes this file:
#2055
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
} | ||
continue; | ||
} | ||
|
||
// TTNN mesh shard expects host input and output | ||
// TODO(jnie): This can be removed once the workaround pass can correctly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please open an issue for me regarding this and how I can repro it? I would like to take a look, currently, there is a canonicalization pass after the workaround pass, so I can take a look at it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Issue created here: #2102
7434d7d
to
1d02669
Compare
1d02669
to
c5e1cca
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, Jackson!
c5e1cca
to
d2234eb
Compare
d2234eb
to
fcc00ee
Compare
### Ticket Fixes #233 ### Problem description With the change tenstorrent/tt-mlir#1744 on tt-mlir main, the output tensor layout is now dram-interleaved, tiled by default. ### What's changed Prior to memcpying the runtime tensor to host memory, we need to call the `tt::runtime::toHost` API to untilize the tensor. ### Checklist - [x] New/Existing tests provide coverage for changes
Before we used to perform conversion from mlir type to tt type in different places in the code. #1744 added type conversion in TTNNLayout which converts all types to tt supported types which makes this workaround redundant.
Description
Part of the runtime stitching effort #1743.
This PR updates the default input/output layout to tiled dram-interleaved from system memory row-major.
Combined the runtime stitching APIs, this enables the user to pre-tilize and interleave tensors (such as weights) and reuse them over multiple programs, eliminating ping-ponging between host/dram, row-major/tile
IR Example
TTNN IR of simple_matmul test on main:
TTNN IR of simple_matmul test after this change:
Changes
TTNNLayout
TTIRToTTNN
Optimizer
Added a workaround that moves GetDeviceOps to the front of the op schedule.
Added a workaround that checks for ReturnOps in L1 usage calculation
Marked layout-forcing tests as XFail.
Runtime
MLIR Tests
TODOs Before Merging
Frontends need to add a
runtime::toHost
call before memcpying tensors.runtime::toHost
accepts anuntilize
flag that will untilize the tensor.Update TODO comments once proper issues are created (optimizer, runtime workaround).